45 research outputs found
DAC: The Double Actor-Critic Architecture for Learning Options
We reformulate the option framework as two parallel augmented MDPs. Under
this novel formulation, all policy optimization algorithms can be used off the
shelf to learn intra-option policies, option termination conditions, and a
master policy over options. We apply an actor-critic algorithm on each
augmented MDP, yielding the Double Actor-Critic (DAC) architecture.
Furthermore, we show that, when state-value functions are used as critics, one
critic can be expressed in terms of the other, and hence only one critic is
necessary. We conduct an empirical study on challenging robot simulation tasks.
In a transfer learning setting, DAC outperforms both its hierarchy-free
counterpart and previous gradient-based option learning algorithms.Comment: NeurIPS 201
Generalized Off-Policy Actor-Critic
We propose a new objective, the counterfactual objective, unifying existing
objectives for off-policy policy gradient algorithms in the continuing
reinforcement learning (RL) setting. Compared to the commonly used excursion
objective, which can be misleading about the performance of the target policy
when deployed, our new objective better predicts such performance. We prove the
Generalized Off-Policy Policy Gradient Theorem to compute the policy gradient
of the counterfactual objective and use an emphatic approach to get an unbiased
sample from this policy gradient, yielding the Generalized Off-Policy
Actor-Critic (Geoff-PAC) algorithm. We demonstrate the merits of Geoff-PAC over
existing algorithms in Mujoco robot simulation tasks, the first empirical
success of emphatic algorithms in prevailing deep RL benchmarks.Comment: NeurIPS 201
Deep Residual Reinforcement Learning
We revisit residual algorithms in both model-free and model-based
reinforcement learning settings. We propose the bidirectional target network
technique to stabilize residual algorithms, yielding a residual version of DDPG
that significantly outperforms vanilla DDPG in the DeepMind Control Suite
benchmark. Moreover, we find the residual algorithm an effective approach to
the distribution mismatch problem in model-based planning. Compared with the
existing TD() method, our residual-based method makes weaker assumptions
about the model and yields a greater performance boost.Comment: AAMAS 202
Direct Gradient Temporal Difference Learning
Off-policy learning enables a reinforcement learning (RL) agent to reason
counterfactually about policies that are not executed and is one of the most
important ideas in RL. It, however, can lead to instability when combined with
function approximation and bootstrapping, two arguably indispensable
ingredients for large-scale reinforcement learning. This is the notorious
deadly triad. Gradient Temporal Difference (GTD) is one powerful tool to solve
the deadly triad. Its success results from solving a doubling sampling issue
indirectly with weight duplication or Fenchel duality. In this paper, we
instead propose a direct method to solve the double sampling issue by simply
using two samples in a Markovian data stream with an increasing gap. The
resulting algorithm is as computationally efficient as GTD but gets rid of
GTD's extra weights. The only price we pay is a logarithmically increasing
memory as time progresses. We provide both asymptotic and finite sample
analysis, where the convergence rate is on-par with the canonical on-policy
temporal difference learning. Key to our analysis is a novel refined
discretization of limiting ODEs.Comment: Submitted to JMLR in Apr 202
GradientDICE: Rethinking Generalized Offline Estimation of Stationary Values
We present GradientDICE for estimating the density ratio between the state
distribution of the target policy and the sampling distribution in off-policy
reinforcement learning. GradientDICE fixes several problems of GenDICE (Zhang
et al., 2020), the state-of-the-art for estimating such density ratios. Namely,
the optimization problem in GenDICE is not a convex-concave saddle-point
problem once nonlinearity in optimization variable parameterization is
introduced to ensure positivity, so any primal-dual algorithm is not guaranteed
to converge or find the desired solution. However, such nonlinearity is
essential to ensure the consistency of GenDICE even with a tabular
representation. This is a fundamental contradiction, resulting from GenDICE's
original formulation of the optimization problem. In GradientDICE, we optimize
a different objective from GenDICE by using the Perron-Frobenius theorem and
eliminating GenDICE's use of divergence. Consequently, nonlinearity in
parameterization is not necessary for GradientDICE, which is provably
convergent under linear function approximation.Comment: ICML 202
Breaking the Deadly Triad with a Target Network
The deadly triad refers to the instability of a reinforcement learning
algorithm when it employs off-policy learning, function approximation, and
bootstrapping simultaneously. In this paper, we investigate the target network
as a tool for breaking the deadly triad, providing theoretical support for the
conventional wisdom that a target network stabilizes training. We first propose
and analyze a novel target network update rule which augments the commonly used
Polyak-averaging style update with two projections. We then apply the target
network and ridge regularization in several divergent algorithms and show their
convergence to regularized TD fixed points. Those algorithms are off-policy
with linear function approximation and bootstrapping, spanning both policy
evaluation and control, as well as both discounted and average-reward settings.
In particular, we provide the first convergent linear -learning algorithms
under nonrestrictive and changing behavior policies without bi-level
optimization.Comment: ICML 202